Chromosome-level genome assembly of the cereal cyst nematode Heterodera flipjevi

As an economically important plant parasitic nematode (PPN), Heterodera filipjevi causes great damage on wheat, and now it was widely recorded in many countries. While multiple genomes of PPNs have been published, high-quality genome assembly and annotation on H. filipjevi have yet to be performed. This study presents a chromosome-scale genome assembly and annotation for H. filipjevi, utilizing a combination of Illumina short-read, PacBio long-read, and Hi-C sequencing technologies. The genome consists of 9 pseudo-chromosomes that contain 134.19 Mb of sequence, with a scaffold N50 length of 11.88 Mb. In total, 10,036 genes were annotated, representing 75.20% of the total predicted protein-coding genes. Our study provides the first chromosome-scale genome for H. filipjevi, which is also the inaugural high-quality genome of cereal cyst nematodes (CCNs). It provides a valuable genomic resource for further biological research and pest management of cereal cyst nematodes disease.


Background & Summary
The cereal cyst nematodes (CCNs) are a group of 12 closely related species and considered to be one of the most damaging plant parasitic nematodes (PPNs) that limit production of cereal crops in many parts of the world including Australia, China, India, USA, Europe, North Africa and West Asia 1,2 .The species Heterodera filipjevi, H. avenae, and H. litipones are among the most economically important species and caused significant economic losses 3 .The yield losses caused by CCNs have been recorded from 10-35% on wheat in China, 24% on spring wheat in Oregon, and 20% on barley in Australia 4 .Among the CCNs, H. filipjevi is an important constraint to cereal crop production in different climatic regions 1 , and now it was widely recorded in many countries such as Tadzhikistan, Russia, Morocco, Tunisia, Pakistan, Libya, Turkey 5,6 , Estonia, Sweden, India, Norway, Iran, China 7 , United Kingdom, and USA 8 .Smiley et al. 4 reported a 35% yield loss in spring wheat in Oregon, USA, due to H. filipjevi 4 , and Karimipour et al. 9 estimated yield losses in wheat yield ranging between 20% and 25% in Iran by the same nematode species 9 .Also, Hajihasani et al. 10 reported that grain yield loss caused by H. filipjevi occurred even at the lowest population density and reached a maximum loss of 48% with an initial population density (Pi) of 20 eggs and J2/ (g soil) -1 in Iran 10 .
Genomic data have proven to be powerful tools to explain the successful parasitization of plant nematodes.The first plant-parasitic nematodes genomes Meloidogyne incognita and M. hapla have been published in 2008.Recently, several PPNs genomes from Globodera pallida 11 , Globodera rostochiensis 12 , Heterodera glycines 13,14 , Bursaphelenchus xylophilus 15 , Bursaphelenchus mucronatus 16 , Ditylenchus destructor 17 , M. floridensis 18 and M. graminicola 19 have been published.However, the available reference genome of CCNs was absent, only the transcriptome of H. avenae was sequenced using short-read sequencing technology [20][21][22][23] .In the present study, a total of 95.79 Gb (711.56 x) raw data was obtained by Illumina, PacBio, 10x Genomics and Hi-C technologies, the detailed sequencing data were summarized in Table 1.The 17-mers were counted as 17,119,184,513 from 21.8 G Illumina short reads, and the k-mer depth was 124 (Table 1).Then, we used PacBio long-read, Illumina short-read, 10 × Genomics and Hi-C data to generate a high-quality chromosome-level reference genome for H.filipjevi.The genome assembly spanned 134.19 Mb with a scaffold N50 length of 11.88 Mb (Table 2).After chromosome-level anchoring, 9 chromosomes with a total length of 120 Mb (89.4% of the draft assembly) were constructed corresponding to the karyotype.In addition, the mapping rate of Illumina short reads was 93.14% and the genome coverage was 99.71% (Table 3).Moreover, 2,226 homozygous single-nucleotide polymorphisms (SNP) and a low homozygous rate (0.0032%) were identified, suggesting a low error rate in the assembly (Table 4).In conclusion, our evaluation indicated a high quality of the assembled H. filipjevi genome.Finally, we annotated 13,352 protein-coding genes in H. filipjevi genome with a mean of 8.14 exons per gene (Table 5 and Table 6) and found 61.9 Mb (46.14%) repeat elements.The reference genome obtained in this study will provide a foundation for future investigations on the pathogenesis of CCNs.

Methods
Nematode sample and DNA extraction.The cysts of H. filipjevi were collected from wheat fields in Xuchang city Henan province.Ten cysts were chosen and inoculated on susceptible wheat cultivars Wenmai 19 in greenhouse for 6 generations.The fresh, healthy and unbroken cysts were manually picked and used for extraction of eggs by crushed in sterile water.the eggs were subsequently collected using the sucrose flotation technique 24 and cleaned with sterile distilled water for three times.Six developmental stages of H. flipjevi including pre-parasitic second stage juveniles (Pre-J2), parasitic-J2 (Para-J2), third stage juveniles (J3), fourth stage juveniles 4 (J4), adult females (Fes) and eggs were collected according to the previous report 20 .
DNA isolation and sequencing.Genomic DNA was isolated from H. filipjevi egg mass according to the CTAB method.The DNA quality and concentration were assessed using agarose gel electrophoresis and a Qubit  2.0 Fluorometer (Life Technologies, CA, USA).A 20 kb insert sizes library was produced following the manufacturer's protocol (PacBio, CA) and sequenced with the PacBio RS technology.For short-read sequencing, libraries with 350 bp insert sizes was prepared and sequenced on Illumina HiSeq 2500 as 2 × 150 bp reads (Table 1).The GEM (Gel Beads in Emulsion) reaction was conducted for 10 × Genomics sequencing using approximately 1 ng input DNA of 50 kb length, and 16 barcodes were introduced into droplets, subsequence, the droplets were fractured and 600 bp fragments were used for constructing libraries, which were sequenced on the Illumina HiSeq X platform at the Novogene Bioinformatics Institute, Beijing.
Genome size estimation, assembly and evaluation.For survey analysis, the H. filipjevi genome size was estimated using the 21.8 Gb paired-end Illumina sequencing data, based on the K-mer formula: Genome size = (total number of 17-mer) / (position of the homozygous peak).With the 14.74 Gb long reads generated from PacBio Sequel platform, the contig assembly of H. filipjevi genome was conducted using the FALCON assembler (version 1.2.4) 25 .Then, the assembly from PacBio data was polished by Quiver (smrtlink 5.0.1) 26 .The heterozygosity of assembly was removed by using the Purge Haplotigs software (version 1.1.1) 27.The resulting contigs were connected to super-scaffolds by 42.58 Gb 10 × Genomics linked-read data using the fragScaff software (Version 140324) 28 (Table 1).Finally, the short reads from Illumina were used to correct any remaining errors by Pilon (Version 1.22) 29 .These processes would yield a final draft H. filipjevi genome.
To acquire a high-quality H. filipjevi genome, the draft assembly was further improved using Hi-C analysis with 16.67 Gb Hi-C data.Firstly, the Hi-C reads were mapped to the draft assembly by using BWA 30 .Then, the low-quality reads and duplications were removed to build raw inter/ intra-chromosomal contact maps.Last, based on the agglomerative hierarchical clustering algorithm 31 , Lachesis (Version 201701) was applied for clustering, ordering and orienting, and the scaffolds from genomics were clustered into 9 pseudochromosomes 32 .Finally, Juicebox software (Version 2.20.00) was used to manually correct scaffolded chromosomes and plot heatmap of genomic interactions 33 .Above all, we obtained a 134,189,547 bp H. filipjevi genome including 9 pseudo-chromosomes, covering ~89.4% of the whole genome (Fig. 1) and 652 supper-scaffolds, the contig N50 and scaffold N50 are 0.45 Mb and 11.88 Mb, respectively (Table 2).Circos (version 0.64) was used to visualize the H. filipjevi genome data 34 .The completeness of genome assembly was assessed by the following methods.First, the Core Eukaryotic Genes Mapping Approach (CEGMA) 35 was conducted based on a core gene set involved in 248 evolutionarily conserved genes from six eukaryotic model organisms.The CEGMA evaluation results showed that 248 CEGs assembled 230 genes, with a proportion of 92.74%, indicating that the assembly results were relatively complete.Second, the BUSCO 36 (version 5.4.3) at genome model was used to evaluate the completeness of genomes in this study using nematoda_odb10 as a database.And we obtained a 55.8% assembly completeness, similar to other reported cyst nematode genomes (43.4-59.3%) 19.Finally, small fragment library reads were aligned to the assembled genome using BWA software the alignment rate of the total small fragment reads to the genome is about 93.14% and the coverage is about 99.71%, indicating a good consistency between the reads and the assembled genomes (Table 3).

Genomic repeat annotation.
Two technologies were applied to the annotation of repetitive sequences within H. filipjevi genome, including homologous comparison and ab initio prediction.For homologous comparison, RepeatMasker (Version 3.3.0)and the associated RepeatProteinMask were performed by aligning against Repbase database 37 .For ab initio prediction, LTR_FINDER (version 1.0.7) 38, RepeatScout (Version 1.0.5) 39and RepeatModeler (Version 1.0.4) were firstly used for de novo candidate database construction of repetitive elements.Followed by, the repetitive sequences were annotated using RepeatMasker, while the tandem repeats were ab initio predicted using TRF (Version 4.07b) 40 .By combining Repbase and de novo datasets, we obtained a total of 61.91 Mb of consensus and nonredundant repetitive sequences, which occupied 46.14% of the genome (Table 5).
In addition, the gene structures of transfer RNAs (tRNA), ribosomal RNAs (rRNA) and other non-coding RNAs in H. filipjevi genome were predicted.Specifically, the tRNA were predicted using tRNAscan-SE software (version 1.3.1) 65.The rRNA fragments were predicted by searching against invertebrate rRNA database using BLAST with an E-value of 1E −10 .The microRNAs (miRNA) and small nuclear RNAs (snRNA) genes were predicted by INFERNAL (version 1.1.1) 66using Rfam database 67 .
The predicted protein-coding genes in H. filipjevi genome were functionally annotated based on homologous searches against databases of SwissProt 68 , NR database (NCBI) 69 , Gene Ontology 70 , InterPro 71 and KEGG pathway 72 .Notably, InterproScan tool 73 in coordination with InterPro database 74 were applied to predict protein function based on the conserved protein domains and functional sites.A total of 10,036 genes (75.20%) were successfully annotated by at least one public database (Fig. 3).

Data records
The raw sequence data reported in this paper have been deposited in the Genome Sequence Archive 75 in National Genomics Data Center (NGDC) 76 , China National Center for Bioinformation/Beijing Institute of Genomics, Chinese Academy of Sciences (GSA: CRA014195) 77 that are publicly accessible at https://ngdc.cncb.ac.cn/gsa.The genome assembly has been deposited in DDBJ/ENA/GenBank under the accession number JBDPZO000000000 78 , and NGDC under the GSA accession CRA015002 79 .Data of the gene functional annotations had been deposited at Figshare 80 .

Technical Validation
Nucleic acid quality.The DNA quality and concentration were assessed using agarose gel electrophoresis and a Qubit 2.0 Fluorometer (Life Technologies, CA, USA).evaluation of genome assembly.Various different strategies were used to evaluate the completeness and accuracy of the H. filipjevi genome.First, our assembly genome was verified to have high completeness by CEGMA 35 (92.74%), indicating that the assembly results are relatively complete.Second, the BUSCO 36 (v5.4.3) at genome model was used to evaluate the completeness of genomes in this study and other published genome, using nematoda_odb10 as a database.13][14]19 .The low completeness of the BUSCO estimates can be attributed to the substantial genetic divergence between the nematoda_odb10 database and cyst nematodes, with large differences in protein sequences.Moreover, to evaluate the accuracy of the assembly, we used BWA software to align small fragment library reads to the assembled genome to calculate the alignment rate, coverage degree and depth of reads.The results show that the alignment rate of the total small fragment reads to the genome is about 93.14% and the coverage is about 99.71%, indicating a good consistency between the reads and the assembled genomes (Table 3).

Fig. 2
Fig. 2 The composition of gene elements in the H. filipjevi genome to other species.(a) CDS length distribution and comparison with other species.(b) Exon length distribution and comparison with other species.(c) Exon number distribution and comparison with other species.(d) Gene length distribution and comparison with other species.(e) Intron length distribution and comparison with other species.

Table 1 .
Sequencing data used for the genome Heterodera flipjevi assembly.

Table 2 .
The statistics of length and number for the de novo assembled Heterodera flipjevi genome.

Table 3 .
Statistics of reads coverage of the Heterodera flipjevi genome.

Table 4 .
SNP statistics of the Heterodera flipjevi genome.

Table 5 .
Statistical results of the repetitive sequences of the Heterodera flipjevi genome.

Table 6 .
Gene annotation of Heterodera flipjevi genome via three methods.Note that CDS refers to coding sequence; GlimmerHMM was a new genefinder based on a Generalized Hidden Markov Model (GHMM); SNAP refers to Semi-HMM-based Nucleic Acid Parser; EVM refers to Evidence modeler.